Brazilian E-Commerce Public Database by Olist https://www.kaggle.com/olistbr/brazilian-ecommerce?select=olist_order_items_dataset.csv
A database foi provida pela Olist, uma empresa de marketplace situada no Brasil. A Olist conecta pequenos negócios por todo o Brasil de uma forma simples. Vendedores conseguem vender seus produtos diretamente pela Olist Store, as entregas são feitas por meio de parceiros.
Quais são os principais fatores que alavancam o número de vendas? Quais os fatores que impactam na avaliação do cliente?
Esse é o segundo notebook, onde iremos realizar clusterização e analisar os mesmos afim de promover insights. No primeiro notebook com o prefixo '01', realizamos toda a visualização do conjunto, onde foram retirados alguns insights que serão utilizados ao longo das proximas etapas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import re
import plotly.graph_objects as go
import dash
from warnings import simplefilter
from functools import reduce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from math import ceil
from sklearn.cluster import KMeans, DBSCAN, AffinityPropagation
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score
from dash import dash_table
from IPython.core.display import HTML, display
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Herikc Brecher e João Hutner" --iversions
Author: Herikc Brecher e João Hutner plotly : 5.1.0 pandas : 1.2.4 numpy : 1.19.5 dash : 2.0.0 seaborn : 0.11.1 re : 2.2.1 matplotlib: 3.2.0
simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
sns.set_theme()
# Seed de Aleatoriedade
seed_ = 194
np.random.seed(seed_)
# Carregamento de todos datasets
dtCustomers = pd.read_csv('../data/olist_customers_dataset.csv', encoding = 'utf8', dtype={'customer_zip_code_prefix': str})
dtGeolocation = pd.read_csv('../data/olist_geolocation_dataset.csv', encoding = 'utf8', dtype={'geolocation_zip_code_prefix': str})
dtOrderItems = pd.read_csv('../data/olist_order_items_dataset.csv', encoding = 'utf8')
dtOrderPayments = pd.read_csv('../data/olist_order_payments_dataset.csv', encoding = 'utf8')
dtOrderReviews = pd.read_csv('../data/olist_order_reviews_dataset.csv', encoding = 'utf8')
dtOrders = pd.read_csv('../data/olist_orders_dataset.csv', encoding = 'utf8')
dtProducts = pd.read_csv('../data/olist_products_dataset.csv', encoding = 'utf8')
dtSellers = pd.read_csv('../data/olist_sellers_dataset.csv', encoding = 'utf8')
Iremos adicionar uma variavel extra para todos os datasets que contém o Estado. A variavel adicionada é a Região referente ao Estado.
regioes = {
'AC': 'Norte',
'AL': 'Nordeste',
'AP': 'Norte',
'AM': 'Norte',
'BA': 'Nordeste',
'CE': 'Nordeste',
'DF': 'CentroOeste',
'ES': 'Sudeste',
'GO': 'CentroOeste',
'MA': 'Nordeste',
'MT': 'CentroOeste',
'MS': 'CentroOeste',
'MG': 'Sudeste',
'PA': 'Norte',
'PB': 'Nordeste',
'PR': 'Sul',
'PE': 'Nordeste',
'PI': 'Nordeste',
'RJ': 'Sudeste',
'RN': 'Nordeste',
'RS': 'Sul',
'RO': 'Norte',
'RR': 'Norte',
'SC': 'Sul',
'SP': 'Sudeste',
'SE': 'Nordeste',
'TO': 'Norte'
}
dtCustomers['regiao'] = dtCustomers['customer_state'].copy()
dtCustomers = dtCustomers.replace({'regiao': regioes})
dtCustomers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtSellers['regiao'] = dtSellers['seller_state'].copy()
dtSellers = dtCustomers.replace({'regiao': regioes})
dtSellers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtGeolocation['regiao'] = dtGeolocation['geolocation_state'].copy()
dtGeolocation = dtGeolocation.replace({'regiao': regioes})
dtGeolocation.head()
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 01037 | -23.545621 | -46.639292 | sao paulo | SP | Sudeste |
| 1 | 01046 | -23.546081 | -46.644820 | sao paulo | SP | Sudeste |
| 2 | 01046 | -23.546129 | -46.642951 | sao paulo | SP | Sudeste |
| 3 | 01041 | -23.544392 | -46.639499 | sao paulo | SP | Sudeste |
| 4 | 01035 | -23.541578 | -46.641607 | sao paulo | SP | Sudeste |
Devido ao problema em questão ser dividido em 8 datasets diferentes, iremos construir datasets mais unificados afim de facilitar o processo de tratamento dos dados e clusterização.
dtCustomers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtGeolocation.head()
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 01037 | -23.545621 | -46.639292 | sao paulo | SP | Sudeste |
| 1 | 01046 | -23.546081 | -46.644820 | sao paulo | SP | Sudeste |
| 2 | 01046 | -23.546129 | -46.642951 | sao paulo | SP | Sudeste |
| 3 | 01041 | -23.544392 | -46.639499 | sao paulo | SP | Sudeste |
| 4 | 01035 | -23.541578 | -46.641607 | sao paulo | SP | Sudeste |
dtOrderItems.head()
| order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
dtOrderPayments.head()
| order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|
| 0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| 1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| 3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
dtOrderReviews.head()
| review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 | NaN | NaN | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 |
| 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 | NaN | NaN | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 |
| 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 | NaN | NaN | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 |
| 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 | NaN | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 |
| 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 | NaN | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 |
dtOrders.head()
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
dtProducts.head()
| product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 |
| 1 | 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 |
| 2 | 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 |
| 3 | cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 |
| 4 | 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 |
dtSellers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
listDtMergeOrders = [dtOrderItems[['order_id', 'freight_value']], dtOrderPayments, dtOrderReviews, dtOrders,\
dtOrderItems.groupby('order_id').agg({'order_item_id': max}).reset_index()]
dtGeneralOrders = reduce(lambda left, right: pd.merge(left, right, on = 'order_id'), listDtMergeOrders)
dtGeneralOrders = pd.merge(dtGeneralOrders, dtCustomers[['customer_id', 'customer_city', 'customer_state', 'regiao']],\
on = 'customer_id')
dtGeneralOrders.head()
| order_id | freight_value | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | ... | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | order_item_id | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | credit_card | 2 | 72.19 | 97ca439bc427b48bc1cd7177abe71365 | 5 | NaN | Perfeito, produto entregue antes do combinado. | ... | delivered | 2017-09-13 08:59:02 | 2017-09-13 09:45:35 | 2017-09-19 18:34:16 | 2017-09-20 23:43:48 | 2017-09-29 00:00:00 | 1 | campos dos goytacazes | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | credit_card | 3 | 259.83 | 7b07bacd811c4117b742569b04ce3580 | 4 | NaN | NaN | ... | delivered | 2017-04-26 10:53:06 | 2017-04-26 11:05:13 | 2017-05-04 14:35:00 | 2017-05-12 16:04:24 | 2017-05-15 00:00:00 | 1 | santa fe do sul | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | credit_card | 5 | 216.87 | 0c5b33dea94867d1ac402749e5438e8b | 5 | NaN | Chegou antes do prazo previsto e o produto sur... | ... | delivered | 2018-01-14 14:33:31 | 2018-01-14 14:48:30 | 2018-01-16 12:36:48 | 2018-01-22 13:19:16 | 2018-02-05 00:00:00 | 1 | para de minas | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | credit_card | 2 | 25.78 | f4028d019cb58564807486a6aaf33817 | 4 | NaN | NaN | ... | delivered | 2018-08-08 10:00:35 | 2018-08-08 10:10:18 | 2018-08-10 13:28:00 | 2018-08-14 13:32:39 | 2018-08-20 00:00:00 | 1 | atibaia | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | credit_card | 3 | 218.04 | 940144190dcba6351888cafa43f3a3a5 | 5 | NaN | Gostei pois veio no prazo determinado . | ... | delivered | 2017-02-04 13:57:51 | 2017-02-04 14:10:13 | 2017-02-16 09:46:09 | 2017-03-01 16:42:31 | 2017-03-17 00:00:00 | 1 | varzea paulista | SP | Sudeste |
5 rows × 23 columns
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_purchase_timestamp'] = \
pd.to_datetime(dtGeneralOrders['order_purchase_timestamp'], format = '%Y-%m-%d %H:%M:%S').dt.to_period('d')
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_delivered_customer_date'] = \
pd.to_datetime(dtGeneralOrders['order_delivered_customer_date'], format = '%Y-%m-%d %H:%M:%S').dt.to_period('d')
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_estimated_delivery_date'] = \
pd.to_datetime(dtGeneralOrders['order_estimated_delivery_date'], format = '%Y-%m-%d %H:%M:%S').dt.to_period('d')
dtGeneralOrders['diff_delivery_and_estimate'] = dtGeneralOrders['order_estimated_delivery_date'] -\
dtGeneralOrders['order_delivered_customer_date']
dtGeneralOrders['diff_delivery_and_purchase'] = dtGeneralOrders['order_purchase_timestamp'] -\
dtGeneralOrders['order_delivered_customer_date']
dtGeneralOrders['diff_delivery_and_estimate'] = \
dtGeneralOrders['diff_delivery_and_estimate'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
dtGeneralOrders['diff_delivery_and_purchase'] = \
dtGeneralOrders['diff_delivery_and_purchase'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
dtGeneralOrders['diff_delivery_and_estimate'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_estimate'])
dtGeneralOrders['diff_delivery_and_purchase'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_purchase'])
dtGeneralOrders.head()
| order_id | freight_value | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | ... | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | order_item_id | customer_city | customer_state | regiao | diff_delivery_and_estimate | diff_delivery_and_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | credit_card | 2 | 72.19 | 97ca439bc427b48bc1cd7177abe71365 | 5 | NaN | Perfeito, produto entregue antes do combinado. | ... | 2017-09-13 09:45:35 | 2017-09-19 18:34:16 | 2017-09-20 | 2017-09-29 | 1 | campos dos goytacazes | RJ | Sudeste | 9.0 | 7.0 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | credit_card | 3 | 259.83 | 7b07bacd811c4117b742569b04ce3580 | 4 | NaN | NaN | ... | 2017-04-26 11:05:13 | 2017-05-04 14:35:00 | 2017-05-12 | 2017-05-15 | 1 | santa fe do sul | SP | Sudeste | 3.0 | 16.0 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | credit_card | 5 | 216.87 | 0c5b33dea94867d1ac402749e5438e8b | 5 | NaN | Chegou antes do prazo previsto e o produto sur... | ... | 2018-01-14 14:48:30 | 2018-01-16 12:36:48 | 2018-01-22 | 2018-02-05 | 1 | para de minas | MG | Sudeste | 14.0 | 8.0 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | credit_card | 2 | 25.78 | f4028d019cb58564807486a6aaf33817 | 4 | NaN | NaN | ... | 2018-08-08 10:10:18 | 2018-08-10 13:28:00 | 2018-08-14 | 2018-08-20 | 1 | atibaia | SP | Sudeste | 6.0 | 6.0 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | credit_card | 3 | 218.04 | 940144190dcba6351888cafa43f3a3a5 | 5 | NaN | Gostei pois veio no prazo determinado . | ... | 2017-02-04 14:10:13 | 2017-02-16 09:46:09 | 2017-03-01 | 2017-03-17 | 1 | varzea paulista | SP | Sudeste | 16.0 | 25.0 |
5 rows × 25 columns
dtGeneralOrders = dtGeneralOrders[['order_id', 'freight_value', 'payment_type', 'payment_installments', 'payment_value',\
'review_score', 'order_status', 'order_item_id',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase',\
'customer_state', 'regiao']]
dtGeneralOrders.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | order_status | order_item_id | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | credit_card | 2 | 72.19 | 5 | delivered | 1 | 9.0 | 7.0 | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | credit_card | 3 | 259.83 | 4 | delivered | 1 | 3.0 | 16.0 | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | credit_card | 5 | 216.87 | 5 | delivered | 1 | 14.0 | 8.0 | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | credit_card | 2 | 25.78 | 4 | delivered | 1 | 6.0 | 6.0 | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | credit_card | 3 | 218.04 | 5 | delivered | 1 | 16.0 | 25.0 | SP | Sudeste |
dtGeneralOrders = dtGeneralOrders[dtGeneralOrders['order_status'] == 'delivered']
print(dtGeneralOrders.isna().sum())
order_id 0 freight_value 0 payment_type 0 payment_installments 0 payment_value 0 review_score 0 order_status 0 order_item_id 0 diff_delivery_and_estimate 1674 diff_delivery_and_purchase 8 customer_state 0 regiao 0 dtype: int64
# Dropando valores NA
dtGeneralOrders = dtGeneralOrders.dropna()
dtGeneralOrders = dtGeneralOrders.drop('order_status', axis = 1)
dtGeneralOrders.shape
(114054, 11)
dtGeneralOrders.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | order_item_id | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste |
dtGeneralOrders = dtGeneralOrders.rename(columns = {'order_item_id': 'quantity', 'regiao': 'customer_region'})
dtGeneralOrders.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste |
dtGeneralOrders.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste |
dtGeneralOrdersProcessado = dtGeneralOrders.copy()
# Realizando LabelEncoder
labelencoderPaymentType = LabelEncoder()
labelencoderCustomerState = LabelEncoder()
labelencoderCustomerRegion = LabelEncoder()
dtGeneralOrdersProcessado['payment_type'] = labelencoderPaymentType.fit_transform(dtGeneralOrders['payment_type'])
dtGeneralOrdersProcessado['customer_state'] =\
labelencoderCustomerState.fit_transform(dtGeneralOrders['customer_state'])
dtGeneralOrdersProcessado['customer_region'] =\
labelencoderCustomerRegion.fit_transform(dtGeneralOrders['customer_region'])
dtGeneralOrdersProcessado.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | 18 | 3 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | 25 | 3 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | 10 | 3 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | 25 | 3 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | 25 | 3 |
colunas_quantitativas = ['freight_value', 'payment_value', 'quantity',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase']
def boxplot_individuais(data, columns, width = 15, height = 8):
fig = plt.figure()
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
fig.set_figheight(8)
fig.set_figwidth(15)
columns_adjust = ceil(len(columns) / 3)
for i, column in enumerate(columns):
ax = fig.add_subplot(columns_adjust, 3, i + 1)
sns.boxplot(x = data[column])
plt.tight_layout()
plt.show()
def hist_individual(data, columns, width = 10, height = 15):
fig = plt.figure()
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
fig.set_figheight(10)
fig.set_figwidth(15)
columns_adjust = ceil(len(columns) / 3)
for i, column in enumerate(columns):
ax = fig.add_subplot(columns_adjust, 3, i + 1)
data[column].hist(label = column)
plt.title(column)
plt.tight_layout()
plt.show()
boxplot_individuais(dtGeneralOrdersProcessado, colunas_quantitativas)
hist_individual(dtGeneralOrdersProcessado, colunas_quantitativas)
scaler = StandardScaler()
#scaler = MinMaxScaler()
dtGeneralOrdersNormalizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersNormalizado[colunas_quantitativas] = scaler.fit_transform(dtGeneralOrdersProcessado[colunas_quantitativas])
dtGeneralOrdersNormalizado.head()
| order_id | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | -0.426745 | 1 | 2 | -0.375092 | 5 | -0.354355 | -0.564103 | -0.567311 | 18 | 3 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | -0.005297 | 1 | 3 | 0.329257 | 4 | -0.354355 | -1.298337 | 0.385421 | 25 | 3 |
| 2 | 000229ec398224ef6ca0657da4fc703e | -0.136047 | 1 | 5 | 0.167997 | 5 | -0.354355 | 0.047760 | -0.461452 | 10 | 3 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | -0.458481 | 1 | 2 | -0.549303 | 4 | -0.354355 | -0.931220 | -0.673170 | 25 | 3 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | -0.118910 | 1 | 3 | 0.172389 | 5 | -0.354355 | 0.292504 | 1.338153 | 25 | 3 |
boxplot_individuais(dtGeneralOrdersNormalizado, colunas_quantitativas)
hist_individual(dtGeneralOrdersNormalizado, colunas_quantitativas)
print(dtGeneralOrdersProcessado[colunas_quantitativas].skew(),\
'\nSoma:', sum(abs(dtGeneralOrdersProcessado[colunas_quantitativas].skew())))
freight_value 5.558939 payment_value 14.448540 quantity 6.331864 diff_delivery_and_estimate 2.846043 diff_delivery_and_purchase 3.912217 dtype: float64 Soma: 33.09760296622308
print(dtGeneralOrdersNormalizado[colunas_quantitativas].skew(),\
'\nSoma:', sum(abs(dtGeneralOrdersNormalizado[colunas_quantitativas].skew())))
freight_value 5.558939 payment_value 14.448540 quantity 6.331864 diff_delivery_and_estimate 2.846043 diff_delivery_and_purchase 3.912217 dtype: float64 Soma: 33.09760296622308
print(dtGeneralOrdersProcessado[colunas_quantitativas].kurtosis(),\
'\nSoma:', sum(abs(dtGeneralOrdersProcessado[colunas_quantitativas].kurtosis())))
freight_value 58.300861 payment_value 530.034479 quantity 68.366469 diff_delivery_and_estimate 32.528515 diff_delivery_and_purchase 40.794495 dtype: float64 Soma: 730.0248192377511
print(dtGeneralOrdersNormalizado[colunas_quantitativas].kurtosis(),\
'\nSoma:', sum(abs(dtGeneralOrdersNormalizado[colunas_quantitativas].kurtosis())))
freight_value 58.300861 payment_value 530.034479 quantity 68.366469 diff_delivery_and_estimate 32.528515 diff_delivery_and_purchase 40.794495 dtype: float64 Soma: 730.024819237751
dtGeneralOrdersFinal = dtGeneralOrdersNormalizado.drop('order_id', axis = 1)
dtGeneralOrdersFinal.head()
| freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.426745 | 1 | 2 | -0.375092 | 5 | -0.354355 | -0.564103 | -0.567311 | 18 | 3 |
| 1 | -0.005297 | 1 | 3 | 0.329257 | 4 | -0.354355 | -1.298337 | 0.385421 | 25 | 3 |
| 2 | -0.136047 | 1 | 5 | 0.167997 | 5 | -0.354355 | 0.047760 | -0.461452 | 10 | 3 |
| 3 | -0.458481 | 1 | 2 | -0.549303 | 4 | -0.354355 | -0.931220 | -0.673170 | 25 | 3 |
| 4 | -0.118910 | 1 | 3 | 0.172389 | 5 | -0.354355 | 0.292504 | 1.338153 | 25 | 3 |
dtGeneralOrdersFinal[dtGeneralOrdersFinal.columns.difference(colunas_quantitativas)].head()
| customer_region | customer_state | payment_installments | payment_type | review_score | |
|---|---|---|---|---|---|
| 0 | 3 | 18 | 2 | 1 | 5 |
| 1 | 3 | 25 | 3 | 1 | 4 |
| 2 | 3 | 10 | 5 | 1 | 5 |
| 3 | 3 | 25 | 2 | 1 | 4 |
| 4 | 3 | 25 | 3 | 1 | 5 |
# Determinando um range de K
k_range = range(1, 12)
# Aplicando o modelo K-Means para cada valor de K
k_means_var = [KMeans(n_clusters = k).fit(dtGeneralOrdersFinal) for k in k_range]
# Ajustando o centróide do cluster para cada modelo
intertia = [X.inertia_ for X in k_means_var]
Analisando a metrica de Elbow é verificado que o numero ideal para K seria 6. Pois, a queda da distancia passa a diminuir fortemente em K = 4, e em 6 tem uma redução brusca novamente.
# Lista de valores de Inertia (Inertia e WCSS são a mesma coisa)
inertia = []
# Loop para testar os valores de K
for n in range(2 , 12):
modelo = (KMeans(n_clusters = n,
init = 'k-means++',
n_init = 10,
max_iter = 300,
tol = 0.0001,
random_state = seed_,
algorithm = 'elkan'))
modelo.fit(dtGeneralOrdersFinal)
inertia.append(modelo.inertia_)
# Plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , 12) , inertia , 'o')
plt.plot(np.arange(2 , 12) , inertia , '-' , alpha = 0.5)
plt.xlabel('Número de Clusters') , plt.ylabel('Inertia')
plt.show()
# Lista de valores de Inertia (Inertia e WCSS são a mesma coisa)
inertia = []
# Loop para testar os valores de K
for n in range(2 , 12):
modelo = (KMeans(n_clusters = n,
init = 'k-means++',
n_init = 10,
max_iter = 300,
tol = 0.0001,
random_state = seed_))
modelo.fit(dtGeneralOrdersFinal)
inertia.append(modelo.inertia_)
# Plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , 12) , inertia , 'o')
plt.plot(np.arange(2 , 12) , inertia , '-' , alpha = 0.5)
plt.xlabel('Número de Clusters') , plt.ylabel('Inertia')
plt.savefig('export/curva_elbow.png')
plt.show()
dtGeneralOrdersTemp = dtGeneralOrdersFinal[:57028].copy()
# Determinando um range de K
k_range = range(1, 10)
# Aplicando o modelo K-Means para cada valor de K (esta célula pode levar bastante tempo para ser executada)
k_means_var = [KMeans(n_clusters = k).fit(dtGeneralOrdersTemp) for k in k_range]
# Ajustando o centróide do cluster para cada modelo
centroids = [X.cluster_centers_ for X in k_means_var]
# Calculando a distância euclidiana de cada ponto de dado para o centróide
k_euclid = [cdist(dtGeneralOrdersTemp, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke, axis = 1) for ke in k_euclid]
# Soma dos quadrados das distâncias dentro do cluster
soma_quadrados_intra_cluster = [sum(d**2) for d in dist]
# Soma total dos quadrados
soma_total = sum(pdist(dtGeneralOrdersTemp)**2)/dtGeneralOrdersTemp.shape[0]
# Soma dos quadrados entre clusters
soma_quadrados_inter_cluster = soma_total - soma_quadrados_intra_cluster
Novamente verificando a variancia dos dados de acordo com cada valor de K, encontramos que o valor ideal é 6.
# Curva de Elbow
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(k_range, soma_quadrados_inter_cluster/soma_total * 100, 'b*-')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Número de Clusters')
plt.ylabel('Percentual de Variância Explicada')
plt.title('Variância Explicada x Valor de K')
plt.savefig('export/variancia_explicada.png')
plt.show()
def testa_k(pca_data, k = 2):
# Criando um modelo
modelo = KMeans(n_clusters = k)
modelo.fit(pca_data)
# Obtém os valores mínimos e máximos e organiza o shape
x_min, x_max = pca_data[:, 0].min() - 5, pca_data[:, 0].max() - 1
y_min, y_max = pca_data[:, 1].min() + 1, pca_data[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = modelo.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot das áreas dos clusters
plt.figure(figsize = (8, 8))
plt.clf()
plt.imshow(Z,
interpolation = 'nearest',
extent = (xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Paired,
aspect = 'auto',
origin = 'lower')
plt.savefig('export/meshgrid_' + str(k) + 'k.png')
plt.show()
# Plot dos centróides
plt.figure(figsize = (13, 13))
plt.plot(pca_data[:, 0], pca_data[:, 1], 'k.', markersize = 4)
centroids = modelo.cluster_centers_
inert = modelo.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1], marker = 'x', s = 169, linewidths = 5, color = 'r', zorder = 8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.savefig('export/centroides_' + str(k) + 'k.png')
plt.show()
# Silhouette Score
labels = modelo.labels_
print(f"Silhouette Score: {silhouette_score(pca_data, labels, metric = 'euclidean')}")
# Aplica redução de dimensionalidade
pca = PCA(n_components = 2).fit_transform(dtGeneralOrdersFinal)
testa_k(pca, 4)
Silhouette Score: 0.6076782177874883
testa_k(pca, 6)
Silhouette Score: 0.6099490650479382
testa_k(pca, 8)
Silhouette Score: 0.5776603331167324
testa_k(pca, 10)
Silhouette Score: 0.6098466127165733
def treina_modelo(k, data, silhouette = True):
# Criando um modelo
modelo = KMeans(n_clusters = k)
modelo.fit(data)
if silhouette:
# Silhouette Score
labels = modelo.labels_
print(f"Silhouette Score: {silhouette_score(data, labels, metric = 'euclidean')}")
# Modelo com k = 6
treina_modelo(6, dtGeneralOrdersFinal)
Silhouette Score: 0.3974316315320899
# Modelo com k = 8
treina_modelo(8, dtGeneralOrdersFinal)
Silhouette Score: 0.3184694372738076
# Criando um modelo
modelo = KMeans(n_clusters = 8)
modelo.fit(dtGeneralOrdersFinal)
KMeans()
dtGeneralOrdersClusterizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.drop('order_id', axis = 1)
dtGeneralOrdersClusterizado['cluster'] = modelo.labels_
dtGeneralOrdersClusterizado['payment_type'] =\
labelencoderPaymentType.inverse_transform(dtGeneralOrdersClusterizado['payment_type'])
dtGeneralOrdersClusterizado['customer_state'] =\
labelencoderCustomerState.inverse_transform(dtGeneralOrdersClusterizado['customer_state'])
dtGeneralOrdersClusterizado['customer_region'] =\
labelencoderCustomerRegion.inverse_transform(dtGeneralOrdersClusterizado['customer_region'])
metodos_pagamento = {
'credit_card': 'Credito',
'debit_card': 'Debito'
}
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.replace({'payment_type': metodos_pagamento})
dtGeneralOrdersClusterizado.head()
| freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13.29 | Credito | 2 | 72.19 | 5 | 1 | 9.0 | 7.0 | RJ | Sudeste | 2 |
| 1 | 19.93 | Credito | 3 | 259.83 | 4 | 1 | 3.0 | 16.0 | SP | Sudeste | 5 |
| 2 | 17.87 | Credito | 5 | 216.87 | 5 | 1 | 14.0 | 8.0 | MG | Sudeste | 6 |
| 3 | 12.79 | Credito | 2 | 25.78 | 4 | 1 | 6.0 | 6.0 | SP | Sudeste | 0 |
| 4 | 18.14 | Credito | 3 | 218.04 | 5 | 1 | 16.0 | 25.0 | SP | Sudeste | 5 |
dtClusters = pd.DataFrame()
dtClusters['Preco_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_value'].mean(), 2)
dtClusters['Frete_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['freight_value'].mean(), 2)
dtClusters['Media_Pagamentos'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_installments'].mean(), 2)
dtClusters['Score_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['review_score'].mean(), 2)
dtClusters['Itens_Medio'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['quantity'].mean(), 2)
dtClusters['Dif_Estimado_Entregue'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['diff_delivery_and_estimate'].mean(), 2)
dtClusters['Dif_Compra_Entregue'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['diff_delivery_and_purchase'].mean(), 2)
dtClusters['Pagamento_Prevalente'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['payment_type'].agg(lambda x: x.value_counts().index[0])
dtClusters['Estado_Prevalente'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['customer_state'].agg(lambda x: x.value_counts().index[0])
dtClusters['Regiao_Prevalente'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['customer_region'].agg(lambda x: x.value_counts().index[0])
display(HTML("<style>.container { width:100% !important; }</style>"))
dtClusters
| Preco_Medio | Frete_Medio | Media_Pagamentos | Score_Medio | Itens_Medio | Dif_Estimado_Entregue | Dif_Compra_Entregue | Pagamento_Prevalente | Estado_Prevalente | Regiao_Prevalente | |
|---|---|---|---|---|---|---|---|---|---|---|
| cluster | ||||||||||
| 0 | 121.76 | 15.29 | 1.18 | 4.18 | 1.37 | 12.27 | 9.67 | Credito | SP | Sudeste |
| 1 | 154.86 | 23.89 | 2.02 | 3.96 | 1.29 | 13.90 | 17.23 | Credito | BA | Nordeste |
| 2 | 143.05 | 21.42 | 1.69 | 3.97 | 1.32 | 15.10 | 14.77 | Credito | RJ | Sudeste |
| 3 | 342.22 | 22.54 | 9.02 | 4.00 | 1.74 | 13.59 | 10.72 | Credito | SP | Sudeste |
| 4 | 318.29 | 28.49 | 7.58 | 3.87 | 1.52 | 15.64 | 15.49 | Credito | RJ | Sudeste |
| 5 | 167.62 | 17.30 | 4.10 | 4.12 | 1.40 | 13.04 | 10.01 | Credito | SP | Sudeste |
| 6 | 152.00 | 22.12 | 1.93 | 4.08 | 1.34 | 14.28 | 13.64 | Credito | MG | Sudeste |
| 7 | 347.74 | 29.81 | 8.30 | 3.94 | 1.66 | 14.91 | 14.86 | Credito | MG | Sudeste |
dtClusters.to_csv('analise_clusters.csv')
dtClustersIndexado = dtClusters.reset_index()
dtClustersIndexado = dtClustersIndexado.rename(columns = {'cluster': 'Cluster'})
dtClustersIndexado['Cluster'] = dtClustersIndexado['Cluster'].apply(lambda x: 'Cluster ' + str(x))
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig = go.Figure(data=[go.Table(
header=dict(
values=list(dtClustersIndexado.columns),
line_color='darkslategray',
fill_color=headerColor,
align=['left','center'],
font=dict(color='white', size=12)
),
cells=dict(
values=[dtClustersIndexado.Cluster, dtClustersIndexado.Preco_Medio, dtClustersIndexado.Frete_Medio, dtClustersIndexado.Media_Pagamentos,\
dtClustersIndexado.Score_Medio, dtClustersIndexado.Itens_Medio, dtClustersIndexado.Dif_Estimado_Entregue,\
dtClustersIndexado.Dif_Compra_Entregue, dtClustersIndexado.Pagamento_Prevalente, dtClustersIndexado.Estado_Prevalente,\
dtClustersIndexado.Regiao_Prevalente],
line_color='darkslategray',
fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*10],
align = ['left', 'center'],
font = dict(color = 'darkslategray', size = 11)
))
])
fig.show()